Search CORE

313 research outputs found

An Evolutionary Algorithm with Crossover and Mutation for Model-Based Clustering

Author: Ashlock Daniel A.
McNicholas Paul D.
McNicholas Sharon M.
Publication venue
Publication date: 08/06/2020
Field of study

An evolutionary algorithm (EA) is developed as an alternative to the EM algorithm for parameter estimation in model-based clustering. This EA facilitates a different search of the fitness landscape, i.e., the likelihood surface, utilizing both crossover and mutation. Furthermore, this EA represents an efficient approach to "hard" model-based clustering and so it can be viewed as a sort of generalization of the k-means algorithm, which is itself equivalent to a restricted Gaussian mixture model. The EA is illustrated on several datasets, and its performance is compared to other hard clustering approaches and model-based clustering via the EM algorithm

arXiv.org e-Print Archive

Mixtures of Variance-Gamma Distributions

Author: Browne Ryan P.
McNicholas Paul D.
McNicholas Sharon M.
Publication venue
Publication date: 29/12/2014
Field of study

A mixture of variance-gamma distributions is introduced and developed for model-based clustering and classification. The latest in a growing line of non-Gaussian mixture approaches to clustering and classification, the proposed mixture of variance-gamma distributions is a special case of the recently developed mixture of generalized hyperbolic distributions, and a restriction is required to ensure identifiability. Our mixture of variance-gamma distributions is perhaps the most useful such special case and, we will contend, may be more useful than the mixture of generalized hyperbolic distributions in some cases. In addition to being an alternative to the mixture of generalized hyperbolic distributions, our mixture of variance-gamma distributions serves as an alternative to the ubiquitous mixture of Gaussian distributions, which is a special case, as well as several non-Gaussian approaches, some of which are special cases. The mathematical development of our mixture of variance-gamma distributions model relies on its relationship with the generalized inverse Gaussian distribution; accordingly, the latter is reviewed before our mixture of variance-gamma distributions is presented. Parameter estimation carried out within the expectation-maximization framework

arXiv.org e-Print Archive

A Variational Approximations-DIC Rubric for Parameter Estimation and Mixture Model Selection Within a Family Setting

Author: McNicholas Paul D.
Subedi Sanjeena
Publication venue
Publication date: 06/11/2019
Field of study

Mixture model-based clustering has become an increasingly popular data analysis technique since its introduction over fifty years ago, and is now commonly utilized within a family setting. Families of mixture models arise when the component parameters, usually the component covariance (or scale) matrices, are decomposed and a number of constraints are imposed. Within the family setting, model selection involves choosing the member of the family, i.e., the appropriate covariance structure, in addition to the number of mixture components. To date, the Bayesian information criterion (BIC) has proved most effective for model selection, and the expectation-maximization (EM) algorithm is usually used for parameter estimation. In fact, this EM-BIC rubric has virtually monopolized the literature on families of mixture models. Deviating from this rubric, variational Bayes approximations are developed for parameter estimation and the deviance information criterion for model selection. The variational Bayes approach provides an alternate framework for parameter estimation by constructing a tight lower bound on the complex marginal likelihood and maximizing this lower bound by minimizing the associated Kullback-Leibler divergence. This approach is taken on the most commonly used family of Gaussian mixture models, and real and simulated data are used to compare the new approach to the EM-BIC rubric

arXiv.org e-Print Archive

Modelling Receiver Operating Characteristic Curves Using Gaussian Mixtures

Author: Cheam Amay
McNicholas Paul D.
Publication venue: 'Elsevier BV'
Publication date: 04/06/2014
Field of study

The receiver operating characteristic curve is widely applied in measuring the performance of diagnostic tests. Many direct and indirect approaches have been proposed for modelling the ROC curve, and because of its tractability, the Gaussian distribution has typically been used to model both populations. We propose using a Gaussian mixture model, leading to a more flexible approach that better accounts for atypical data. Monte Carlo simulation is used to circumvent the issue of absence of a closed-form. We show that our method performs favourably when compared to the crude binormal curve and to the semi-parametric frequentist binormal ROC using the famous LABROC procedure

arXiv.org e-Print Archive

Mixture Model Averaging for Clustering

Author: McNicholas Paul D.
Wei Yuhong
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 26/07/2014
Field of study

In mixture model-based clustering applications, it is common to fit several models from a family and report clustering results from only the `best' one. In such circumstances, selection of this best model is achieved using a model selection criterion, most often the Bayesian information criterion. Rather than throw away all but the best model, we average multiple models that are in some sense close to the best one, thereby producing a weighted average of clustering results. Two (weighted) averaging approaches are considered: averaging the component membership probabilities and averaging models. In both cases, Occam's window is used to determine closeness to the best model and weights are computed within a Bayesian model averaging paradigm. In some cases, we need to merge components before averaging; we introduce a method for merging mixture components based on the adjusted Rand index. The effectiveness of our model-based clustering averaging approaches is illustrated using a family of Gaussian mixture models on real and simulated data

arXiv.org e-Print Archive

Robust Clustering in Regression Analysis via the Contaminated Gaussian Cluster-Weighted Model

Author: McNicholas Paul D.
Punzo Antonio
Publication venue
Publication date: 21/09/2014
Field of study

The Gaussian cluster-weighted model (CWM) is a mixture of regression models with random covariates that allows for flexible clustering of a random vector composed of response variables and covariates. In each mixture component, it adopts a Gaussian distribution for both the covariates and the responses given the covariates. To robustify the approach with respect to possible elliptical heavy tailed departures from normality, due to the presence of atypical observations, the contaminated Gaussian CWM is here introduced. In addition to the parameters of the Gaussian CWM, each mixture component of our contaminated CWM has a parameter controlling the proportion of outliers, one controlling the proportion of leverage points, one specifying the degree of contamination with respect to the response variables, and one specifying the degree of contamination with respect to the covariates. Crucially, these parameters do not have to be specified a priori, adding flexibility to our approach. Furthermore, once the model is estimated and the observations are assigned to the groups, a finer intra-group classification in typical points, outliers, good leverage points, and bad leverage points - concepts of primary importance in robust regression analysis - can be directly obtained. Relations with other mixture-based contaminated models are analyzed, identifiability conditions are provided, an expectation-conditional maximization algorithm is outlined for parameter estimation, and various implementation and operational issues are discussed. Properties of the estimators of the regression coefficients are evaluated through Monte Carlo experiments and compared to the estimators from the Gaussian CWM. A sensitivity study is also conducted based on a real data set

arXiv.org e-Print Archive

Parsimonious Skew Mixture Models for Model-Based Clustering and Classification

Author: McNicholas Paul D.
Vrbik Irene
Publication venue: 'Elsevier BV'
Publication date: 10/02/2013
Field of study

In recent work, robust mixture modelling approaches using skewed distributions have been explored to accommodate asymmetric data. We introduce parsimony by developing skew-t and skew-normal analogues of the popular GPCM family that employ an eigenvalue decomposition of a positive-semidefinite matrix. The methods developed in this paper are compared to existing models in both an unsupervised and semi-supervised classification framework. Parameter estimation is carried out using the expectation-maximization algorithm and models are selected using the Bayesian information criterion. The efficacy of these extensions is illustrated on simulated and benchmark clustering data sets

arXiv.org e-Print Archive

Clustering Airbnb Reviews

Author: McNicholas Paul D.
Tang Yang
Publication venue
Publication date: 27/06/2019
Field of study

In the last decade, online customer reviews increasingly exert influence on consumers' decision when booking accommodation online. The renewal importance to the concept of word-of mouth is reflected in the growing interests in investigating consumers' experience by analyzing their online reviews through the process of text mining and sentiment analysis. A clustering approach is developed for Boston Airbnb reviews submitted in the English language and collected from 2009 to 2016. This approach is based on a mixture of latent variable models, which provides an appealing framework for handling clustered binary data. We address here the problem of discovering meaningful segments of consumers that are coherent from both the underlying topics and the sentiment behind the reviews. A penalized mixture of latent traits approach is developed to reduce the number of parameters and identify variables that are not informative for clustering. The introduction of component-specific rate parameters avoids the over-penalization that can occur when inferring a shared rate parameter on clustered data. We divided the guests into four groups -- property driven guests, host driven guests, guests with recent overall negative stay and guests with some negative experiences

arXiv.org e-Print Archive

A LASSO-Penalized BIC for Mixture Model Selection

Author: Bhattacharya Sakyajit
McNicholas Paul D.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 27/11/2012
Field of study

The efficacy of family-based approaches to mixture model-based clustering and classification depends on the selection of parsimonious models. Current wisdom suggests the Bayesian information criterion (BIC) for mixture model selection. However, the BIC has well-known limitations, including a tendency to overestimate the number of components as well as a proclivity for, often drastically, underestimating the number of components in higher dimensions. While the former problem might be soluble through merging components, the latter is impossible to mitigate in clustering and classification applications. In this paper, a LASSO-penalized BIC (LPBIC) is introduced to overcome this problem. This approach is illustrated based on applications of extensions of mixtures of factor analyzers, where the LPBIC is used to select both the number of components and the number of latent factors. The LPBIC is shown to match or outperform the BIC in several situations

arXiv.org e-Print Archive

Variational Bayes Approximations for Clustering via Mixtures of Normal Inverse Gaussian Distributions

Author: McNicholas Paul D.
Subedi Sanjeena
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 07/09/2013
Field of study

Parameter estimation for model-based clustering using a finite mixture of normal inverse Gaussian (NIG) distributions is achieved through variational Bayes approximations. Univariate NIG mixtures and multivariate NIG mixtures are considered. The use of variational Bayes approximations here is a substantial departure from the traditional EM approach and alleviates some of the associated computational complexities and uncertainties. Our variational algorithm is applied to simulated and real data. The paper concludes with discussion and suggestions for future work

arXiv.org e-Print Archive